Existing federated classification algorithms typically assume the local annotations at every client cover the same set of classes. In this paper, we aim to lift such an assumption and focus on a more general yet practical non-IID setting where every client can work on non-identical and even disjoint sets of classes (i.e., client-exclusive classes), and the clients have a common goal which is to build a global classification model to identify the union of these classes. Such heterogeneity in client class sets poses a new challenge: how to ensure different clients are operating in the same latent space so as to avoid the drift after aggregation? We observe that the classes can be described in natural languages (i.e., class names) and these names are typically safe to share with all parties. Thus, we formulate the classification problem as a matching process between data representations and class representations and break the classification model into a data encoder and a label encoder. We leverage the natural-language class names as the common ground to anchor the class representations in the label encoder. In each iteration, the label encoder updates the class representations and regulates the data representations through matching. We further use the updated class representations at each round to annotate data samples for locally-unaware classes according to similarity and distill knowledge to local models. Extensive experiments on four real-world datasets show that the proposed method can outperform various classical and state-of-the-art federated learning methods designed for learning with non-IID data.
translated by 谷歌翻译
We present a method for introducing a text encoder into pre-trained end-to-end speech translation systems. It enhances the ability of adapting one modality (i.e., source-language speech) to another (i.e., source-language text). Thus, the speech translation model can learn from both unlabeled and labeled data, especially when the source-language text data is abundant. Beyond this, we present a denoising method to build a robust text encoder that can deal with both normal and noisy text data. Our system sets new state-of-the-arts on the MuST-C En-De, En-Fr, and LibriSpeech En-Fr tasks.
translated by 谷歌翻译
The essential task of urban planning is to generate the optimal land-use configuration of a target area. However, traditional urban planning is time-consuming and labor-intensive. Deep generative learning gives us hope that we can automate this planning process and come up with the ideal urban plans. While remarkable achievements have been obtained, they have exhibited limitations in lacking awareness of: 1) the hierarchical dependencies between functional zones and spatial grids; 2) the peer dependencies among functional zones; and 3) human regulations to ensure the usability of generated configurations. To address these limitations, we develop a novel human-instructed deep hierarchical generative model. We rethink the urban planning generative task from a unique functionality perspective, where we summarize planning requirements into different functionality projections for better urban plan generation. To this end, we develop a three-stage generation process from a target area to zones to grids. The first stage is to label the grids of a target area with latent functionalities to discover functional zones. The second stage is to perceive the planning requirements to form urban functionality projections. We propose a novel module: functionalizer to project the embedding of human instructions and geospatial contexts to the zone-level plan to obtain such projections. Each projection includes the information of land-use portfolios and the structural dependencies across spatial grids in terms of a specific urban function. The third stage is to leverage multi-attentions to model the zone-zone peer dependencies of the functionality projections to generate grid-level land-use configurations. Finally, we present extensive experiments to demonstrate the effectiveness of our framework.
translated by 谷歌翻译
由于在扫描和量化过程中引入了不可避免的噪声,因此通过RGB-D传感器进行的3D重建在几何和纹理中遇到了错误,导致了诸如摄像机漂移,网格失真,纹理幽灵和模糊之类的伪像。考虑到不完美的重建3D模型,大多数以前的方法都集中在几何,纹理或摄像头姿势的完善上。或在以前的关节优化方法中使用了不同的优化方案和优化每个组件的目标,形成了复杂的系统。在本文中,我们提出了一种基于可区分渲染的新型优化方法,该方法通过在渲染结果与相应的RGB-D输入之间执行一致性,将相机姿势,几何形状和纹理的优化整合到统一框架中。基于统一的框架,我们引入了一种联合优化方法,以完全利用几何,纹理和摄像头之间的相互关系,并描述一种自适应交织策略,以提高优化稳定性和效率。使用可区分的渲染,应用图像级的对抗损失用于进一步改善3D模型,从而使其更加逼真。使用定量和定性评估进行合成和真实数据的实验证明了我们在恢复高尺度几何形状和高保真质地方面的优越性。
translated by 谷歌翻译
我们提出了一些动态神经辐射场(FDNERF),这是第一种基于NERF的方法,能够根据少量动态图像重建和表达3D面的表达编辑。与需要密集图像作为输入的现有动态NERF不同,并且只能为单个身份建模,我们的方法可以使跨不同人的不同人进行面对重建。与设计用于建模静态场景的最先进的几杆NERF相比,提出的FDNERF接受视图的动态输入,并支持任意的面部表达编辑,即产生具有输入超出输入的新表达式的面孔。为了处理动态输入之间的不一致之处,我们引入了精心设计的条件特征翘曲(CFW)模块,以在2D特征空间中执行表达条件的翘曲,这也是身份自适应和3D约束。结果,不同表达式的特征被转换为目标的特征。然后,我们根据这些视图一致的特征构建一个辐射场,并使用体积渲染来合成建模面的新型视图。进行定量和定性评估的广泛实验表明,我们的方法在3D面重建和表达编辑任务上都优于现有的动态和几乎没有射击的NERF。我们的代码和模型将在接受后提供。
translated by 谷歌翻译
目前,基于端到端深度学习的开放域对话系统仍然是黑匣子模型,使其易于与数据驱动的模型生成无关的内容。具体而言,由于缺乏指导培训的先验知识,潜在变量在潜在空间中与不同的语义纠缠在一起。为了解决这个问题,本文提议通过涉及介绍量表特征分离的认知方法来利用生成模型。特别是,该模型将宏观指导类别知识和微观级别的开放域对话数据集成到培训中,并将先验知识利用到潜在空间中,从而使模型能够将潜在变量置于介镜范围内的潜在变量。此外,我们为开放域对话提出了一个新的指标,可以客观地评估潜在空间分布的解释性。最后,我们在不同的数据集上验证了我们的模型,并在实验上证明我们的模型能够比其他模型产生更高的质量和更容易解释的对话。
translated by 谷歌翻译
新颖性检测旨在自动识别分销(OOD)数据,而无需任何先验知识。它是数据监视,行为分析和其他应用程序中的关键步骤,帮助在现场中保持不断学习。常规的OOD检测方法对数据或特征的集合进行多变化分析,通常诉诸于数据的监督,以提高准确性。实际上,这种监督是不切实际的,因为人们不能预料到异常数据。在本文中,我们提出了一种小说,自我监督的方法,不依赖于任何预定义的OOD数据:(1)新方法评估梯度之间的分布和OOD数据之间的Mahalanobis距离。 (2)通过自我监督的二进制分类器辅助,以指导标签选择以生成梯度,并最大化Mahalanobis距离。在具有多个数据集的评估中,例如CiFar-10,CiFar-100,SVHN和TINIMAGENET,所提出的方法始终如一地优于接收器操作特征(AUROC)和区域下的区域内的最先进的监督和无监督的方法在精密召回曲线(AUPR)度量下。我们进一步证明,该探测器能够在持续学习中准确地学习一个OOD类。
translated by 谷歌翻译
In this paper, we propose a novel architecture, the Enhanced Interactive Transformer (EIT), to address the issue of head degradation in self-attention mechanisms. Our approach replaces the traditional multi-head self-attention mechanism with the Enhanced Multi-Head Attention (EMHA) mechanism, which relaxes the one-to-one mapping constraint among queries and keys, allowing each query to attend to multiple keys. Furthermore, we introduce two interaction models, Inner-Subspace Interaction and Cross-Subspace Interaction, to fully utilize the many-to-many mapping capabilities of EMHA. Extensive experiments on a wide range of tasks (e.g. machine translation, abstractive summarization, grammar correction, language modelling and brain disease automatic diagnosis) show its superiority with a very modest increase in model size.
translated by 谷歌翻译
Document images are a ubiquitous source of data where the text is organized in a complex hierarchical structure ranging from fine granularity (e.g., words), medium granularity (e.g., regions such as paragraphs or figures), to coarse granularity (e.g., the whole page). The spatial hierarchical relationships between content at different levels of granularity are crucial for document image understanding tasks. Existing methods learn features from either word-level or region-level but fail to consider both simultaneously. Word-level models are restricted by the fact that they originate from pure-text language models, which only encode the word-level context. In contrast, region-level models attempt to encode regions corresponding to paragraphs or text blocks into a single embedding, but they perform worse with additional word-level features. To deal with these issues, we propose MGDoc, a new multi-modal multi-granular pre-training framework that encodes page-level, region-level, and word-level information at the same time. MGDoc uses a unified text-visual encoder to obtain multi-modal features across different granularities, which makes it possible to project the multi-granular features into the same hyperspace. To model the region-word correlation, we design a cross-granular attention mechanism and specific pre-training tasks for our model to reinforce the model of learning the hierarchy between regions and words. Experiments demonstrate that our proposed model can learn better features that perform well across granularities and lead to improvements in downstream tasks.
translated by 谷歌翻译
The spread of misinformation is a prominent problem in today's society, and many researchers in academia and industry are trying to combat it. Due to the vast amount of misinformation that is created every day, it is unrealistic to leave this task to human fact-checkers. Data scientists and researchers have been working on automated misinformation detection for years, and it is still a challenging problem today. The goal of our research is to add a new level to automated misinformation detection; classifying segments of text with persuasive writing techniques in order to produce interpretable reasoning for why an article can be marked as misinformation. To accomplish this, we present a novel annotation scheme containing many common persuasive writing tactics, along with a dataset with human annotations accordingly. For this task, we make use of a RoBERTa model for text classification, due to its high performance in NLP. We develop several language model-based baselines and present the results of our persuasive strategy label predictions as well as the improvements these intermediate labels make in detecting misinformation and producing interpretable results.
translated by 谷歌翻译